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O ' Abstract 

^ ' Background: One-dimensional protein structures such as secondary structures or contact num- 

bers are useful for three-dimensional structure prediction and helpful for intuitive understanding of the 
sequence-structure relationship. Accurate prediction methods will serve as a basis for these and other 
purposes. 

Results: We implemented a program CRNPRED which predicts secondary structures, contact numbers 
^ ' and residue-wise contact orders. This program is based on a novel machine learning scheme called criti- 

cs , cal random networks. Unlike most conventional one-dimensional structure prediction methods which are 

based on local windows of an amino acid sequence, CRNPRED takes into account the whole sequence. 
CRNPRED achieves, on average per chain, Q3 — 81% for secondary structure prediction, and correlation 
coefficients of 0.75 and 0.61 for contact number and residue-wise contact order predictions, respectively. 
Conclusion: CRNPRED will be a useful tool for computational as well as experimental biologists who 
^^ [ need accurate one- dimensional protein structure predictions. 

]o ! Background 

^ ' One-dimensional (ID) structures of a protein are residue-wise quantities or symbols onto which some fea- 

^ , tures of the native three-dimensional (3D) structure are projected. ID structures are of interest for several 

reasons. For example, predicted secondary structures, a kind of ID structures, are often used to limit the 
conformational space to be searched in 3D structure prediction. Furthermore, it has recently been shown 
C^ ' that certain sets of the native (as opposed to predicted) ID structures of a protein contain sufficient infor- 

mation to recover the native 3D structure [1,2]. These ID structures are either the principal eigenvector of 
the contact map [1] or a set of secondary structures (SS), contact numbers (CN) and residue- wise contact 
orders (RWCO) [2]. Therefore, it is possible, at least in principle, to predict the native 3D structure by 
first predicting the ID structures, and then by constructing the 3D structure from these ID structures. 
ID structures are not only useful for 3D structure predictions, but also helpful for intuitive understanding 
of the correspondence between the protein structure and its amino acid sequence due to the residue-wise 
characteristics of ID structures. Therefore, accurate prediction of ID protein structures is of fundamental 
biological interest. 

Secondary structure prediction has a long history [3]. Almost all the modern predictors are based on 
position-specific scoring matrices (PSSM) and some kind of machine learning techniques such as neural 
networks or support vector machines. Currently the best predictors achieve Q3 of 77-79% [4,5]. The study 
of contact number prediction also started long time ago [6,7], but further improvements were made only 
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recently [8-10]. These recent methods are based on the ideas developed in SS predictions (i.e., PSSM and 
machine learning), and achieve a correlation cocfRcient of 0.68-0.73. 

Recently, we have developed a new method for accurately predicting SS, CN, and RWCO based on a 
novel machine learning scheme, critical random networks (CRN) [10]. In this paper, we briefly describe 
the formulation of the method, and recent improvements leading to even better predictions. The computer 
program for SS, CN, and RWCO prediction named CRNPRED has been developed for the convenience of 
the general user, and a web interface and source code are made available online. 

Implementation 

Definition of ID structures 

Secondary structures (SS): Secondary structures were defined by the DSSP program [11]. For three-state 
SS prediction, the simple encoding scheme (the so-called CK mapping) was employed [12]. That is, a 
helices (H), P strands {E), and other structures ("coils") defined by DSSP were encoded as H, E, and 
C, respectively. Note that we do not use the CASP-style conversion scheme (the so-called EHL mapping) 
in which DSSP's H, G (3io helix) and / (tt helix) are encoded as iJ, and DSSP's E and B (/? bridge) as 
E. We believe the CK mapping is more natural and useful for 3D structure predictions (e.g., geometrical 
restraints should be different between an a helix and a 3io helix). For SS prediction, we introduce feature 
variables (Ui^ ,yf ,yf) to represent each type of secondary structures at the i-th residue position, so that H 
is represented as (1, —1, —1), E as (—1, 1, —1), and C as (—1, —1, 1). 

Contact numbers (CN): Let Cij represent the contact map of a protein. Usually, the contact map is 
defined so that Ci_j = 1 if the i-th and j'-th residues are in contact by some definition, or Cij — 0, otherwise. 
As in our previous study, we slightly modify the definition using a sigmoid function. That is, 

C,,, =l/{l + cxp[zi;(r,,, -d)]} (1) 

where Vij is the distance between C/3 {Ca for glycines) atoms of the i-th and j-th residues, d = 12A is a 
cutoff distance, and w is a sharpness parameter of the sigmoid function which is set to 3 [2,8]. The rather 
generous cutoff length of 12A was shown to optimize the prediction accuracy [8]. The use of the sigmoid 
function enables us to use the contact numbers in molecular dynamics simulations [2]. Using the above 
definition of the contact map, the contact number of the i-th residue of a protein is defined as 

n^= E ^^^- (2) 

3-\i-3\>2 

The feature variable yi for CN is defined as j/i = Ui/ log L where L is the sequence length of a target 
protein. The normalization factor logL is introduced because we have observed that the contact number 
averaged over a protein chain is roughly proportional to logi, and thus division by this value removes the 
size-dependence of predicted contact numbers. 

Residue-wise contact orders (RWCO): RWCO was first introduced in [2]. This quantity measures the 
extent to which a residue makes long-range contacts in a native protein structure. Using the same notation 
as contact numbers, the RWCO of the i-th residue in a protein structure is defined by 

o^^ Y^ \i-j\C^^j. (3) 

j-\i-j\>2 

The feature variable yi for RWCO is defined as yi = Oi/L where L is the sequence length. Due to the similar 
reason as CN, the normalization factor L was introduced to remove the size-dependence of the predicted 
RWCOs (the RWCO averaged over a protein chain is roughly proportional to the chain length). 



Critical random networks 

Here we briefly describe the critical random network (CRN) method introduced in [10] which should be 
referred to for the details. Unlike most conventional methods for ID structure prediction [except for some 
including the bidirectional recurrent neural networks [5,13,14]], the CRN method takes the whole amino 
acid sequence into account. In the CRN method, an A^-dimensional state vector x^ is assigned to the i-th 
residue of the target sequence (we use N — 5000 throughout this paper). Neighboring state vectors along 
the sequence are connected via a random N x N orthogonal matrix W. This matrix is also block-diagonal 
with the size of blocks ranging uniformly randomly between 2 and 50. The input to the CRN is the position- 
specific scoring matrix (PSSM), U = (ui, • • ■ , u^) of the target sequence obtained by PSI-BLAST [15] (L is 
the sequence length of the target protein) . We impose that the state vectors satisfy the following equation 
of state: 

X, = tanh[/3Ty(xj_i + x^+i) + aVu,] (4) 

for i = 1, • • • , L where V is an A^ x 21 random matrix (the 21st component of Uj is always set to unity), 
and (3 and a are scalar parameters. The fixed boundary condition is imposed (xq = x^+i = 0). By setting 
P = 0.5, the system of state vectors is made to be near a critical point in a certain sense, and thus the range 
of site-site correlation is expected to be long when a is sufficiently small but finite [10] . In this way, each 
state vector implicitly incorporates long-range correlations. The ID structure of the i-th residue is predicted 
as a linear projection of a local window of the PSSM and the state vector obtained by solving Eq. 01 

M 21 N 

J/j = ^ ^ Dm,aUa,i+m + ^ EkXk,i (5) 

m— — M a—1 k—1 

where yi is the predicted quantity, and Dm,a and Ek are the regression parameters. In the first summation, 
each PSSM column is extended to include the "terminal" residue. Since Eq. I^lis a simple linear equation 
once the equation of state (Eq. 0J has been solved, learning the parameters Dm. a and E^ reduces to an 
ordinary linear regression problem. For SS prediction, the triple {y^ ,yf ,yf) is calculated simultaneously, 
and the SS class is predicted as argmax^gj^ ^ ^j y|. For the CN and RWCO prediction, real values are 
predicted (2-state prediction is also made for CN using the average CN for each residue type as the threshold 
for "exposed" or "buried" as in [16]). The half window size M is set to 9 for SS and CN predictions, and to 
26 for RWCO. 

Ensemble prediction 

Since the CRN-based prediction is parametrized by the random matrices W and V ^ slightly different predic- 
tions are obtained for different pairs of W and V . We can improve the prediction by taking the average over 
an ensemble of such different predictions. 20 CRN-based predictors were constructed using 20 sets of different 
random matrices W and V . CN and RWCO are predicted as uniform averages of these 20 predictions. 

For SS prediction, we employ further training. Let s^'" be the prediction results of the n-th predictor for 
ID structure t {H, E, C, CN, and RWCO) of the i-th residue. The second stage SS prediction is made by 
the following linear scheme: 

20 3 

yr = EE E ^".^™4"™ (6) 

n— 1 t m— — 3 

where ss = H, E, C, and Wn,t,m is the weight obtained from a training set. Finally, the feature variable 
for each SS class of the i-th residue is obtained by (yfli + 2?;|* -I- y|^i)/4. This last procedure was found 
particularly effective for improving the segment overlap (SOV) measure. 

Additional input 

Another improvement is the addition of the amino acid composition of the target sequence to the predictor [9] : 
The term J2a=i Eafa was added to Eq. where Fa is a regression parameter, and fa is the fraction of the 



amino acid type a. 

Training and test data set 

We carried out a 15-fold cross-validation test following exactly the same procedure and the same data set 
as the previous study [10]. In the data set, there are 680 protein domains, each of which represents a 
superfamily according to the SCOP database (version 1.65) [17]. This data set was randomly divided so 
that 630 domains were used for training and the remaining 50 domains for testing, and the random division 
was repeated 15 times. No pair of these domains belong to the same superfamily, and hence they are not 
expected to be homologous. Thus, the present benchmark is a very stringent one. 

For obtaining PSSMs by running PSI-BLAST, we use the UniReflOO (version 6.8) amino acid sequence 
database [18] containing some 3 million entries. Also the number of iterations in PSI-BLAST homology 
searches was reduced to 3 times from 10 used in the previous study. This especially increased the accuracy 
of SS predictions. These results are consistent with the study of [19]. 

Numerics 

One drawback of the CRN method is the computational time required for numerically solving the equation 
of state (Eq. ^. For that purpose, instead of the Gauss-Seidel-like method previously used, we implemented 
a successive over-relaxation method which was found to be much more efficient. 

Let i^ denote the stage of iteration. We set the initial value of the state vectors (with j/ = 0) as 

xf^ =tanh[ayu,]. (7) 

Then, for i = 1, ■ • ■ , L (in increasing order of i), we update the state vectors by 

^(2.+i) ^ xp"-' +c.{tanh[M/(x(^';+i' 

+x(f,)) + ayu,]-x,f'')}. (8) 

Next, we update them in the reverse order. That is, for i = L, ■ ■ ■ ,1 (in decreasing order of i), 

(2u+2) {2u+l) , r, 1 riT// (21^+1) 
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(2i/+2)^ I -j^ 1 (2I/+1) 



aVn^-^^^''}. (9) 



We then set v ^- i^ -I- 1, and iterate Eqs. lO and © until {x^} converges. The acceleration parameter of 
uj — \A was found effective. The convergence criterion is 



\ i=\ 

where || • Urat denotes the Euchdean norm. This criterion is much less stringent than previous study (10""^), 
but this does not affect the prediction accuracy significantly. Convergence is typically achieved within 10 to 
12 iterations for one protein. 

Results and Discussion 

There are two main ingredients for the improved one-dimensional protein structure prediction in the present 
study. First is the use of large-scale critical random networks of 5000 dimension and 20 ensemble predictors. 
Second is the use of a large sequence database (UniReflOO) for PSI-BLAST searches. As demonstrated 
in Table 1, the CRN method achieves remarkably accurate predictions. In comparison with the previous 



study [10] based on 2000-dimensional CRNs (10 ensemble predictors), the Q3 and SOV measures in SS 
predictions improved from 77.8% and 77.3% to 80.5% and 80.0%, respectively. Similarly, the average cor- 
relation coefficient improved from 0.726 to 0.746 for CN predictions, and from 0.601 to 0.613 for RWCO 
predictions. The 2-state predictions for CN yields, on average, Q2 = 76.8% per chain and 76.7% per residue, 
and Matthews' correlation coefficient of 0.533. 

A closer examination of the SS prediction results (Table 2) reveals the drastic improvement of /3 strand 
prediction from Qe — 61.9% to 69.3% (per residue). Although the values of Qc and Q^^ are slightly lower 
than in the previous study by 0.6-1.0%, the accuracies of other classes have improved by 2.5-4%. 

CRNPRED compares favorably with other secondary structure prediction methods. The widely used 
PSIPRED program [4,20] which is based on conventional feed-forward neural networks achieves Q3 of 78%. 
A more recently developed method. Porter, [5] which is based on bidirectional recurrent neural networks 
achieves Q^ of 79%. An even more intricate method based on bidirectional segmented-memory recurrent 
neural networks [14] shows an accuracy of Q3 = 73% (this rather low accuracy may be attributed to the 
small size of training set used). However, it should be reminded that these studies are based on different 
data sets for both training and testing as well as the definition of secondary structural categories. Therefore, 
these comparisons may not be very informative, but only give a rough estimation of relative performance. 

Regarding the contact number prediction, CRNPRED, achieving Cor = 0.75, is the most accurate method 
available today. The simple linear method [8] with multiple sequence alignment derived from the HSSP 
database [21] showed a correlation coefficient of 0.63. A more advanced method based on support vector 
machines (local window-based) achieves a correlation of 0.68 per chain [9]. 

It is known that the number of homologs found by the PSI-BLAST searches significantly affects the 
prediction accuracies [19]. We have examined this effect by plotting the accuracy measures for a given 
minimum number of homologs found by PSTBLAST (Fig. 1). For example, we see in Fig. 1 that, for those 
proteins with more than 100 homologs, the average Q3 for SS predictions is 82.2%. The effect of the number 
of homologs significantly depends on the type of ID structure. For SS prediction, Q3 steadily increases as the 
number of homologs increases up to 100, but it stays in the range between 82.0 and 82.4 until the minimum 
number of homologs reaches around 400, and then it starts to decrease. For CN prediction. Cor also increases 
steadily but more slowly, and it does not degrade when the minimum number of homologs reaches 500. This 
tendency implies that CN is more conservative than SS during protein evolution, which is consistent with 
previous observations [22,23]. On the contrary, RWCO exhibits a peculiar behavior. The value of Cor 
reaches its peak at the minimum number of homologs of 80 beyond which the value rapidly decreases. This 
indicates that RWCO is not evolutionarily well conserved. It was observed that the accuracies of SS and CN 
predictions constantly increased when the dimension of CRNs was increased from 2000 to 5000, but such 
was not the case for RWCO (data not shown). RWCO seems to be such delicate a quantity that it is very 
difficult to extract relevant information from the amino acid sequence. 

Finally, we note on practical applicability of predicted ID structures. We do not believe, at present, that 
the construction of a 3D structure purely from the predicted ID structures is practical, if possible at all, 
because of the limited accuracy of the RWCO prediction. However, SS and CN predictions are very accurate 
for many proteins so that they may already serve as valuable restraints for 3D structure predictions. Also, SS 
and CN predictions may be applied to domain identification often necessary for experimental determination 
of protein structures. CRNPRED has been proved useful for such a purpose [24]. Although of the limited 
accuracy, predicted RWCOs still exhibit significant correlations with the correct values. Since RWCOs 
reflect the extent to which a residue is involved in long-range contacts, predicted RWCOs may be useful for 
enumerating potentially structurally important residues. 

An interesting alternative application of the CRN framework is to regard the solution of the equation of 
state (Eq. ^ as an extended sequence profile. By so doing, it is straightforward to apply the solution to the 
profile-profile comparison for fold recognition [25]. Such an application may be also pursued in the future. 



Availability and Requirements 

Project name: CRNPRED 
Project home page: 



http://bioinforinatics.org/crnpred/ 



Operating system: UNIX- like OS (including Linux and Mac OS X). 

Programming language: C. 

Other requirements: zsh, PSI-BLAST (blastpgp), The UniReflOO amino acid sequence database. 

License: Public domain. 

Any restrictions to use by non-academics: None. 
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Figure 1 

Average accuracy measure for given minimum number of homofogs found by PSI-BLAST. From top to 
bottom: Qs of secondary structure predictions, Cor of contact number predictions, and Cor of residue-wise 
contact number predictions. 
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Tables 

Table 1 - Summary of average prediction accuracies per chain (median in paren- 
theses). 



ss 


Q3= 80.5% (81.6) 


SOV= 80.0% (81.1) 


CN 


Cor= 0.746 (0.768) 


DevA= 0.686 (0.670) 


RWCO 


Cor^ 0.613 (0.646) 


DevA= 0.877 (0.812) 



SS, Secondary structure prediction: Qs is the percentage of correct prediction.; SOY is the segment 
overlap measure [26]. 

CN, Contact number prediction: Cor is the Pearson's correlation coefficient between the predicted and 
native CNs; DevA is the RMS error normalized by the standard deviation of the native CN [8] . 
RWCO, Residue- wise contact order prediction: Cor and DevA are defined as for CN but calculated with 
predicted and native RWCOs. 

Table 2: Summary of per-residue accuracies for SS predictions. 



measure H E C 

~Qs 82?7 69^3 SilT 

QP-"" 84.4 78.9 78.3 

MC 0.754 0.674 0.645 

Qs'- The number of correctly predicted residues of the SS class s ~ H, E,C divided by the number of 
residues in the class in native structures. 

Qpre. rpj^g number of correctly predicted residues of the SS class s — H,E,C divided by the number of 
residues predicted as the corresponding class. 
MC: Matthews' correlation coefficient. 
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